Final Project

Group 1:

  • Hiba Awan

  • Nathania Stephens

Abstract

This research project investigates Fairfax County Police Department (FCPD) data to derive law enforcement and crime patterns. The object of this project is to analyze and interpret the publicly available data for arrest, citations, and warning through visualization and statistical techniques. This helps to quantify and visualize crime and law enforcement trends throughout Fairfax County for 2023.

The methods used in this project involved extensive data cleaning and feature creation to calculate and visualize outcome frequencies, as well citation and warning rates across various temporal and geospatial variables. Visual illustrations were generated using R, showcasing these rates over time. Statistical analysis included Chi-Squared Test to access association and a Binary Logistic Regression models to determine statistical significance and effects of factors like gender, time, and location to determine law enforcement outcome. The model provided odds ratios for variables such as hour of day and day of month to quantify the outcome probability for citation or warning.

Key findings showed significant temporal influence in citation outcome. The analysis showed higher citation odds during morning rush hour and lower citation odds in certain districts. Further emphasize the importance of time and space for understanding law enforcement trends and highlights their essential role for applying predictive methods.

Overall this project provides data-driven insight into FCPD law enforcement and crime patterns and how they are strongly influenced by temporal and geospatial variables. For future analyses, these findings coupled with more structured data on the exact offense type would very likely provide greater predictability and deeper understanding of law enforcement outcomes.

Introduction

Motivation

In 2023, there were over 30,000 arrests and close to 65,000 citations in Fairfax County. The Fairfax County boundaries, include areas such as Centreville, Chantilly, Herndon, Reston, Tysons Corner, McLean, Merrifield, George Mason, Annadale, Burke, Springfield, Alexandria, Lorton to name a few. If you live, work, or study in these areas then this project should be of interest to you. This project aims to inform Fairfax County patrons of violation, warning, and arrest trends and hopefully provide some statistical insights that is applicable.

Goals

Provide relevant, understandable, and insightful crime patterns using several different visualization and statistical methods. Deliver clear and concise graphs and charts that help readers easily interpret and compare data. Utilize statistical learning techniques to examine data, interpret statistical significant factors, and understand associations between variables. Since the data utilized for this project is largely categorical this project focuses on techniques such as Chi-Squared Test and Logistic Regression.

Data

Overview

The Fairfax County Police Department (FCPD) operate in eight police districts and within those districts they are broken out into smaller Police Service Areas (PSAs). In 2023, the Fairfax County Police Department had more arrest and citation than the year prior. There were more assault offenses, theft and larceny. However, there were less homicide offenses, sex offenses, robbery, and fatal crashes. FCPD makes this information and its corresponding data available on their website (FCDP Open Data Portal, n.d). Three datasets were pulled from the Fairfax County Police Department website. They covered arrest, citations, and warnings in the year 2023. For simplicity, general definitions are provided:

  • Arrest - When a person is taken into custody to answer for an offense or when there is a deprivation or restraint of a person’s liberty in any significant way.

  • Citation - Formal notice issued by law enforcement officer for a violation of law, typically related to traffic laws or other minor offenses. Typically requiring a violator to appear in court or pay a fine.

  • Warning - When a violation, typically minor, has been made but an officer issues a warning rather than a citation.

The data sets included between 24 and 34 variables, but many of the variables were redundant or were not applicable to the research (e.g. web_address, phone_number, name).(FCDP Open Data Portal, n.d) Redundant columns and non-applicable columns were removed (Table 1). Additional columns were created from parsed information derived from original attributes, e.g. Hour of Day and Day of Month.

Table 1: Attribute table
Column Name Data Type Description
Date Date Date of Violation
Time Chr Time of Violation
Offense Chr Description of Violation
Gender Chr Gender of Violator
Ethnicity Chr Hispanic or Non-Hispanic
District Chr Administrative area
Latitude Dbl Coordinates measuring north/ south of equator
Longitude Dbl Coordinates measuring east/ west of prime meridian
Outcome Chr Result of violation, arrest, citation, or warning

Limitations and Assumptions

Due to the nature of the data available from the Fairfax County Police Department website, analysis was limited to qualitative techniques. The approach taken for the project focused on predicting through qualitative responses or classification. This means that each record pulled from the Fairfax County Police Department (FCPD) would be assigned to a category or class.

While understanding local crime is the goal of this project, the data acquired only accounts for crime that was recorded by FCPD. It does not take into account crimes that were not report or any other crime that was not reported through FCPD channels.

Cleaning and Transformation

In cases where applicable, duplicate and null values were removed. To address questions related to gender, the data needed to be standardized and correctly categorized. Column names needed to be consistent across the three datasets to merge. Gender was used over Sex. Next the column data would be transformed to consistent labels, e.g. Male, Female, and Other/Unknown. Records that included unverified or unknown were also removed in cases where they did not contribute to the overall analysis. To ensure these removals would not corrupt the analysis, proportions were examined to ensure its removal would not effect the overall results, typically less than 2% proportion. Depending on the research question, determined which datasets were merged.

Research Questions

  1. Is there an association between gender and warnings?

  2. How does time and location factor into the odds of a Citation?

  3. How do enforcement outcomes vary across districts and demographic groups?

Research & Analysis

Question 1: Is there an association between gender and warnings?

To address this question the null and alternative hypothesis are established.

Null Hypothesis: There is no association between gender and violation outcome, warning or citation. This would mean that the likelihood of a violator getting a warning is independent of gender.

Alternative Hypothesis: There is an association between gender and violation outcome. This implies that gender affects the outcome of whether a violator is given a citation or warning.

According to the cleaned and combined dataset for warnings and citation, there was a total of 88,320 records. By looking at the counts for each outcome (citation or warning), there are a lot more citations than there are warnings given out by FCPD. This stacked bar chart also shows that males have a higher count for both categories.

Next, the warning rate for gender is calculated. This looks at the probability of a male or female violator receiving a Warning instead of a citation e.g. getting out of a ticket. To calculate warning rate, the number of warnings are divided by the total number of incidents.

\[ \begin{align*} Warning Rate = \frac{\text{Number of Warnings}}{\text{Total Incidents (Warnings + Citations)}} \end{align*} \]

This shows a slight difference in proportion between the two genders, with females having a higher warning rate than males. In other words, females received more warnings than males. Is this difference significant or is it a result of chance or other factors? To help understand these results, the Chi-Square Test of Independence is used. The Chi-Square Test of Independence will help determine whether the variables, gender and outcome, are independent or if there is a relationship between them.

To implement the Chi-Square Test, a contingency table is generated, which shows the distribution for gender and outcome.

\[ \chi^2 = \sum \frac{(O-E)^2}{E} \]

Gender Citations Warnings
Female 20,478 8,777
Male 43,657 15,408

The results of the Chi-Square test shows:

  • Chi-Square Statistic (x-squared): The chi-square test statistic is 150.62. This is the discrepancy between the observed frequencies, citations and warnings, and the expected frequencies if there were no association between the gender and outcome. This is demonstrated in the below tables.
Gender Expected Citations Observed Citations Expected Warnings Observed Warnings
Female 21,243 20,478 8,011 8,777
Male 42,891 43,657 16,173 15,408
  • Degrees of Freedom (df): The degree of freedom for this test is 1, which is the number of rows minus 1 multiplied by number of columns minus 1.

  • p-value: The p-value is 2.2e-16 which is much smaller than 0.05. This represents the probability of observing the chi-square statistic, 150.62, or more if the null hypothesis were true.

To illustrate this, a heatmap is generated showing each values contribution to the chi-square test. This easily demonstrates that female and warning had the highest percentage.

Thus the null hypothesis is rejected. The results show that there is a statistically significant association between gender and violation outcome in Fairfax County. In other words, the Chi-square test indicates that the likelihood of a violation outcome is significantly associated with gender.

Question 2: How Does Time and Location Factor into the Odds of a Citations?

For this research question, variables such as time of day, progression of week, day of month, and district are examined to determine if those variable influence an officer’s decision to issue a warning or citation. For this question, arrests are excluded since it is assumed a higher threshold of danger is imminent and would allow for less use of officers’ discretion.

First, to explore the likelihood of getting a citation versus a warning (outcome of a violation), the citation rate is calculated, using the following equation.

\[ \begin{align*} Citation Rate = \frac{\text{Number of Citations}}{\text{Total Incidents (Warnings + Citations)}} \end{align*} \]

Then, a heatmap is generated to see if there are any patterns that would indicate trends over hours of the day and throughout the month.

The higher intensity or darker color areas represent a higher citation rate. During the hours of 0500-0600 (24-hour clock) the citation rate is much higher across most days of the month. This would imply that officers are less likely to issue warnings between 5am and 6am. This time also aligns with morning rush hour and higher volumes of traffic are expected. However, the same pattern of higher citation rates are not observed during evening rush hour. The heatmap also does not show higher intensities during the end of the month. Which is what we would expect to see if officers were attempting to meet monthly ticket quotas at the end of the month.

To visualize citation and warnings geospatially, a map is generated using the latitude and longitude associated to each record. While this map is helpful for understanding concentrations of citations and warning. It is misleading in show the most dominate outcomes. This is likely due to the quantity of data points overlaid on one another; however, since it is interactive areas can be zoomed in on for further insight.

Next, through the use of Logistic Regression, the odds ratios is examined between citations and warnings for Fairfax county. Predictor variables, time of day, week progression, day of month, district are used to predict the probability that a warning or citation will occur. The results provide an odd ratios at the intercept of 0.532 which would indicate that the odds of getting a warning are less than 50/50. In other words, if pulled over for a traffic violation, it is not a flip of a coin whether you get a ticket or not. Your chance of getting a citation is higher according to the intercept generated through this logistic regression model. Additionally, the model showed factors that did increased the likelihood of getting a warning occurred in certain districts (Mount Vernon, Mason, and Franconia) and progression of the week influences leniency. In weekly progression, an odds ratio of 1.595 shows that as the week progresses from Monday to Sunday, the odds of recieving a warning increase by 59.5 %. Alternatively, factors such as time of day and day of the month showed an increased likelihood of a citation. Using the below graph, the top five odds ratio for citation and bottom five odds ratio for warnings are depicted.

Gender and race were originally included but removed to isolate only time and space variables. Additionally, after running both models, the Area Under the Curve (AUC) increased only slightly (+ 0.0004) after removal of race and gender. This would mean that while gender and race were statistically significant, they were not as strong of predictors as date/time and district when classifying an outcome of either citation or warning. The AUC for the model that focused date/time and district was 0.6289. Which means that 63% of the citation/warning outcome is predictable using day of the month, time of day, progression of week, and district. The other 37% of the outcome is due to other factors not captured in this model. These factors could include offense type, violator’s behavior, and officer variables. While not a particularly strong predictive model, the model helps understand the effects of time and location.

Question 3: How do enforcement outcomes vary across districts and demographic groups?

Finally, we’ll be exploring the relationships between the outcomes of crime incidents and their district and demographic (gender, race, ethnicity) variables. Addressing this final question involved combining the three data sets of different crime outcomes: arrest data, citation data, and warning data.

Starting with the outcomes across different districts, we start off by performing a chi-square test to establish whether there’s a relationship between the crime outcomes and the districts they took place in.

Null Hypothesis: Enforcement outcome is independent of police district.

Alternative hypothesis: Enforcement outcome and police district are associated.

For the chi-square implementation, a contingency table for the district and outcome counts is generated:

              
               Arrest Citation Warning
  BRADDOCK       1221     7647    2645
  DRANESVILLE     881     5055    2088
  FRANCONIA      6011     6534    3499
  HUNTER MILL    2001     6244    2474
  MASON          5833     5490    2678
  MOUNT VERNON   3040     3662    2451
  PROVIDENCE     9315     4772    1941
  SPRINGFIELD    1176     9909    2672
  SULLY          2300    14854    3758
  Unverified      356     1127     154

From both the visualization above and the table output, we can see that citations are the most common outcome among the districts. However, this won’t affect the test as chi-square allows unequal margin totals. The other outcomes, arrests and warnings, vary substantially in volume across districts. Some districts (like Sully, Springfield, Providence) show particularly high citation counts, while others (Braddock, Dranesville) show lower overall enforcement volumes.

Then the test is performed:

              
               Arrest Citation Warning
  BRADDOCK       1221     7647    2645
  DRANESVILLE     881     5055    2088
  FRANCONIA      6011     6534    3499
  HUNTER MILL    2001     6244    2474
  MASON          5833     5490    2678
  MOUNT VERNON   3040     3662    2451
  PROVIDENCE     9315     4772    1941
  SPRINGFIELD    1176     9909    2672
  SULLY          2300    14854    3758
  Unverified      356     1127     154

The chi-square statistic is extremely large, which indicates a substantial deviation between observed and expected counts. The p-value is also low enough that we reject the null hypothesis. Therefore, there is a statistically significant association between district and enforcement outcome.

From this, we can see that enforcement severity varies by district and district-level patterns justify further modeling.

In order to test both the district and demographic variables affects, below we perform logistic regression:

Call:
multinom(formula = Outcome ~ Race + Gender + District + Ethnicity, 
    data = combined_data)

Coefficients:
         (Intercept)     RaceB    RaceI    RaceU      RaceW    GenderM
Citation    6.446043 -1.270532 1.902943 2.020289 -0.1401901 -0.5121848
Warning    -2.979146 -1.162020 1.981047 1.571784 -0.0923412 -0.6626383
           GenderO  GenderU  GenderX DistrictDRANESVILLE DistrictFRANCONIA
Citation  6.010004 7.558657 9.927126          -0.0666923         -1.475278
Warning  -4.276199 9.096641 8.356495           0.1233499         -1.022930
         DistrictHUNTER MILL DistrictMASON DistrictMOUNT VERNON
Citation          -0.6712792     -1.629807           -1.3727811
Warning           -0.5462693     -1.210128           -0.7197438
         DistrictPROVIDENCE DistrictSPRINGFIELD DistrictSULLY
Citation          -2.389145           0.3690253     0.1267567
Warning           -2.237610           0.1084139    -0.1961090
         DistrictUnverified EthnicityB EthnicityH EthnicityN EthnicityU
Citation         -0.7239367   2.511770  -4.582736  -3.813774  -2.729531
Warning          -1.6738146  -2.361734   3.525458   4.685802   5.843154

Std. Errors:
         (Intercept)      RaceB     RaceI     RaceU      RaceW    GenderM
Citation  0.03690439 0.03664586 0.2797341 0.1046626 0.03631312 0.01796448
Warning   0.04097933 0.04157964 0.2865267 0.1097935 0.04063792 0.02044004
              GenderO   GenderU   GenderX DistrictDRANESVILLE DistrictFRANCONIA
Citation 1.384588e-08 0.1736266 0.5231826          0.04893589        0.03673167
Warning  1.161816e-09 0.1736270 0.5231557          0.05426427        0.04186914
         DistrictHUNTER MILL DistrictMASON DistrictMOUNT VERNON
Citation          0.04137716    0.03725741           0.04078278
Warning           0.04710735    0.04304069           0.04551201
         DistrictPROVIDENCE DistrictSPRINGFIELD DistrictSULLY
Citation         0.03681733          0.04456369    0.03906339
Warning          0.04385473          0.05022441    0.04459307
         DistrictUnverified   EthnicityB EthnicityH EthnicityN EthnicityU
Citation         0.07127717 9.699283e-08 0.02127470 0.01745390 0.03459567
Warning          0.10481264 9.677125e-08 0.02480225 0.01935154 0.03690519

Residual Deviance: 213967.8 
AIC: 214055.8 

Performing multinomial logistic regression shows that enforcement outcomes vary systematically across both police districts and demographic groups. Certain districts, such as Providence and Sully, have higher relative odds of arrest compared to citations or warnings. Race and ethnicity are also significant predictors: Black and Hispanic individuals are more likely to be arrested rather than receive less severe enforcement actions. Gender has smaller but detectable effects. These results indicate that enforcement outcomes are influenced by both location and demographic characteristics.

Conclusion

Through the research conducted in this project we were able to examine the law enforcement outcomes in Fairfax County using data from 2023. We were able to reject our null hypothesis which stated that there was no association between gender and violation outcome. Results from the Chi-Square test showed that there was a statistically significant association between gender and outcome. However, while gender was statistically significant, gender was found to be a weak predictor after running logistic regression models. Factors such as time and location had more of an influence when predicting the odds ratio of a warning versus citation. The Area Under of the Curve (AUC) for the logistic regression model was about 63% which considering the limitations of the dataset focusing specifically on time and space emphasizes the importance of time and location patterns for crime prediction. If factors such as offense type and description where used this would have very likely resulted in a stronger predictive model. While prediction was limited, this project highlights time and space trends showing odds ratio for citations between 5am and 6am were higher for receiving a citation while odds for receiving a warning increased as the week progressed. As noted, was the higher likelihood of a warning in the Mount Vernon, Mason, and Franconia districts.

References

  1. Fairfax County Police Department (FCPD) Open Data Portal. (n.d.). FCPD Annual Reports. https://www.fcpod.org/pages/annual-reports

  2. Google. (2025). Gemini (Flash 2.5) [Large language model]. Retrieved from https://gemini.google.com/

  3. Thevapalan, A. (2024, August 29). Chi-Square Test in R: A Complete Guide. DataCamp. https://www.datacamp.com/tutorial/chi-square-test-r

  4. Chugh, V. (2023, March 17). Logistic Regression in R Tutorial. DataCamp. https://www.datacamp.com/tutorial/logistic-regression-R

  5. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R (1st ed. 2013.). Springer New York. https://doi.org/10.1007/978-1-4614-7138-7